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A  Tool  for  Detecting  Plagiarism  in  Pascal  Programs 
Thesis  directed  by  Professor  Lloyd  D.  Fosdick 

Plagiarism  has  become  a  problem  in  introductory 
Computer  Science  courses.  Programmed  assignments  can  be 
copied  emd  transformed  with  little  human  effort.  A 
pertinent  recommendation  has  resulted  from  this  realiza¬ 
tion:  an  on-line  system  to  detect  programs  that  are  "too 
similar"  and  hence  suspected  of  plagiarism  should  be 
developed  15].  The  purpose  of  this  thesis  has  been  to 
construct  such  a  system  in  the  form  of  Program  Accuse. 

Program  Accuse  analyzes  Pascal  programs  to  detect 
those  pairs  of  programs  such  that  plagiarism  is  a  possi¬ 
bility. 

An  overriding  concern  of  the  development  of  Accuse 
has  been  that  it  be  inexpensive  to  use.  In  addition,  the 
use  of  Accuse  is  intended  for  introductory  Computer 
Science  courses.  The  result  is  a  program  that  is  effi¬ 
cient,  but  limited  in  its  ability  to  detect  sophisticated 
plagiarism.  Efficiency  means  low  cost;  lack  of  compre¬ 
hensive  analysis  is  rationalized  with  the  assumption  that 
the  student  clever  enough  to  plagiarize  with  sophistica¬ 
tion  has  no  need  to  plagiarize. 

Accuse  measures  20  parameters  in  each  program: 
for  example,  total  lines  in  the  program,  variables 


declared  and  not  used,  amd  the  number  of  control  state¬ 
ments.  Seven  of  these  parameters  were  chosen  through 
testing  as  a  means  to  compute  a  correlation  number  that 
determines  if  two  programs  are  similar. 

If  two  programs  are  considered  similar,  they  are 


flagged  for  the  \iser  to  inspect  and  m£dce  the  judgement 
as  to  whether  plagiarism  occurred.  y 
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CHAPTER  I 


INTRODUCTION 


Plagiarism  has  become  a  problem  in  introductory 
Computer  Science  courses.  Programmed  assignments  can  be 
copied  and  transformed  with  little  human  effort.  A 
pertinent  recommendation  has  resulted  from  this  realiza¬ 
tion:  an  on-line  system  to  detect  programs  that  are  "too 
similar"  and  hence  suspected  of  plagiarism  should  be 
developed  [5] .  The  purpose  of  this  thesis  has  been  to 
construct  such  a  system  in  the  form  of  Program  Accuse. 

Program  Accuse  analyzes  Pascal  programs  to  detect 
those  pairs  of  programs  such  that  plagiarism  is  a  possi¬ 
bility. 

An  overriding  concern  of  the  development  of  Accuse 
has  been  that  it  be  inexpensive  to  use.  In  addition,  the 
use  of  Accuse  is  intended  for  introductory  Computer 
Science  courses.  The  result  is  a  program  that  is  effi¬ 
cient,  but  limited  in  its  ability  to  detect  sophisticated 
plagiarism.  Efficiency  means  low  cost;  lack  of  compre¬ 
hensive  analysis  is  rationalized  with  the  assumption  that 
the  student  clever  enough  to  plagiarize  with  sophistica¬ 
tion  has  no  need  to  plagiarize. 

Accuse  measures  20  parameters  in  each  program: 
for  example,  total  lines  in  the  program,  variables 


declcured  and  not  used,  and  the  number  of  control  state¬ 
ments.  Seven  of  these  parameters  were  chosen  through 
testing  as  a  means  to  compute  a  correlation  number  that 
determines  if  two  programs  are  similar. 

If  two  programs  are  considered  similar,  they  are 
flagged  for  the  user  to  inspect  and  make  the  judgement 
as  to  whether  plagiarism  occurred. 


CHAPTER  II 


BACKGROUND 


An  attempt  to  construct  such  an  on-line  system 
has  been  made  at  Purdue  University  by  K.J.  Ottenstein  [4], 
He  developed  a  program  that  quantifies  the  sameness  of 
Fortran  programs  using  the  four  basic  Software  Science 
parameters  suggested  by  M.  Halstead  as  useful  measures  of 
program  length  [3].  These  parameters  are:  (1)  the  number 
of  unique  operators,  (2)  the  number  of  unique  operands, 

(3)  the  total  number  of  occurrences  of  operators,  and  (4) 
the  total  number  of  occurrences  of  operands.  It  seems 
the  first  suggestion  to  use  these  parameters  as  measures 
of  similarity  or  dissimilarity  (depending  on  your  view¬ 
point)  came  from  N,  Bulut  as  a  by-product  of  his  study  of 
invariant  properties  of  algorithms  [1] . 

M.  Halstead  developed  the  notion  of  Software 
Science  in  1972.  He  advances  the  four  parameters  above 
as  properties  of  any  computer  program  that  are  capable  of 
being  counted  or  measured.  He  defines  these  parameters 
and  their  relationships  as  follows  [3] : 

nl  =  number  of  unique  operators 
n2  =  number  of  unique  operands 
Nl  =  total  number  of  operators 
N2  =  total  number  of  operands 
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vocabulary  n  *  nl  +  n2 
length  N  =  HI  +  N2 

He  also  provides  data  to  support  the  following  relation¬ 
ship  [3]  : 

N  =  nl  log  nl  +  n2  log  n2 

Ottenstein's  progr2un  utilizes  only  the  four  basic 
Software  Science  parameters,  and  it  counts  them  in  a 
straightforward  manner.  He  acknowledges  his  program 
detects  only  cosmetic  changes:  reordering  time  inde¬ 
pendent  statements,  recommenting,  reformatting  of  text, 
and  renaming  variables  and  labels.  He  believes  that 
plagiarism  can  be  deterred  both  by  the  knowledge  of  the 
existence  of  a  program  like  his  and  its  ability  to  make 
it  reasonably  difficult  to  cheat  successfully  14]. 

Ottenstein  uses  the  length  N  to  categorize  his 
input  programs.  Those  that  have  identical  Nl,  N2,  nl, 
and  n2  counts  are  then  suspected  of  plagiarism  [4]. 

Inherent  in  M.  Halstead's  theories  is  the  assump¬ 
tion  that  programs  are  well-written  and  polished.  For 
example,  in  almost  all  cases  for  which  the  length  indi¬ 
cator  (N)  was  tested,  the  programs  had  been  prepared  for 
publication  [2] . 

M.  Halstead  recognized  that  not  all  programs 
would  be  well-written,  and  hence  derived  and  defined  six 
classes  of  impurities  as  follows  [3] ; 
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(1)  complementary  operations:  the  successive 
application  of  two  complementary  operators  to  the  same 
oper^md 

example:  R:=T*T+T-T 

(2)  ambiguous  operands:  the  same  opercind  name  is 
used  to  represent  two  or  more  variables  within  a  program 

exeunple:  R  :=  P  +  Q;  R  :=  R  *  R 

(3)  synonymous  operands:  using  two  operand  names 
to  represent  the  same  variable  within  a  program 

example:  T1  :=  P  +  Q;  T2  :=  P  +  Q;  R  :=  T1  +  T2 

(4)  common  subexpressions:  the  same  subexpression 
occurs  more  than  one  time  within  a  program 

example:  R  :=  (P*Q)  +  (P*Q) 

(5)  unwarranted  assignment:  an  expression  is 
assigned  to  a  temporary  operand  that  is  used  only  once 

example:  T  :=  P  ■«-  Q?  R  :=  T; 

(6)  unfactored  expressions:  the  same  operators  and 
operands  repeat  in  an  expression  (maJclng  the  expression 
difficult  to  understand) 

example:  R:=P*P+2*P*Q+Q*Q 

Fitzsimmons  and  Love  conjecture  that  a  compiler 
can  detect  all  of  these  impurities  [2] ,  and  only  (6) 
above  cannot  be  mechanically  corrected  [3] .  Any  system 
that  attempts  to  detect  plagiarism  can  expect  to  encounter 
these  impurities. 


C 


CHAPTER  III 


DESIGN  OF  ACCUSE 


Two  principle  ideas  guided  the  development  of 
Accuse:  (1)  that  Accuse  be  as  inexpensive  to  use  as 

possible,  and  (2)  that  the  individual  able  enough  to 
plagiarize  cleverly  has  no  need  to  plagiarize. 

When  construction  of  Accuse  was  being  planned, 
the  idea  of  using  the  front  end  of  a  compiler  as  the 
driver  was  considered.  There  were  several  reasons  for 
this:  (1)  the  desire  to  use  as  much  shelf  material  as 

possible,  and  (2)  the  lack  of  awareness  of  Software 
Science  for  this  particular  application. 

After  the  discovery  of  Ottenstein's  attempt  and 
his  method,  it  was  felt  that  a  counter  could  be  written 
that  would  be  faster  than  even  a  stripped  down  compiler. 
However,  because  Accuse  is  not  a  compiler,  it  needs  to  be 
used  in  the  context  of  a  larger  tool  that  retrieves  pro¬ 
grams,  compiles  them  and  saves  their  output  for  graders, 
cuid  then  sends  them  to  a  file  for  processing  by  Accuse. 

The  result  of  not  using  a  compiler  is  a  compro¬ 
mise  between  speed  and  comprehensive  analysis.  Accuse 
processes  over  170  lines  per  second.  However,  as  noted 
above,  it  will  not  discover  changes  made  by  the  sophis¬ 
ticated  plagiarist.  Again,  this  is  rationalized  with 


the  assumption  that  the  student  intelligent  enough  to 
plagiarize  with  sophistication  has  no  need  to  plagiarize. 

Accuse  was  designed  top-down,  but  implemented 
from  the  bottom  up.  Each  module  was  developed  as  needed; 
for  while  we  knew  the  main  components  of  the  system,  it 
was  impossible  to  predict  the  support  routines.  A 
module ' s  ability  to  achieve  the  desired  counts  was  certi¬ 
fied  before  construction  of  the  next  module. 

Program  Accuse  was  constructed  with  the  belief 
that  additional  parameters  are  available  beyond  the  four 
basic  Software  Science  parameters,  and  that  heuristics 
can  be  employed  to  achieve  more  than  detection  of  cosmetic 
changes.  Using  these  heuristics  and  seven  parameters, 
Accuse  computes  a  correlation  niunber  that  is  used  to  deter¬ 
mine  the  similarity  of  two  programs. 

Accuse  measures  20  parameters.  The  seven  that 
comprise  the  correlation  number  were  selected  by  testin-r 
different  combinations  of  them. 

Accuse  measures  the  following  20  parcuneters  (for 
full  definitions  see  Appendix  A) : 

1.  total  lines 

2.  code  lines 

3.  code  comment  lines 


4.  multiple  statement  lines 

5.  constants  and  types 

6.  variables  declared  (and  used) 
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7.  variables  declared  (and  not  used) 

8.  procedures  and  functions 
9  .  var  parzuneters 

10 .  value  parameters 

11.  procedure  variables  (includes  9  and  10) 

12 .  for  s-ta-tements 

13.  repeat  statements 

14 .  while  statements 

15.  goto  statements 

16.  unique  operators 

17.  unique  operands 

18.  total  operators 

19.  total  operands 

20.  indenting  function 

The  seven  parameters  that  comprise  the  correlation 
number  are; 

1 .  unique  operators 

2.  unique  operands 

3.  total  operators 

4.  total  operands 

5.  code  lines 

6.  variables  declared  (and  used) 

7.  total  control  statements 

While  being  constru<^ted ,  it  was  believed  that  an 
"indenting  function"  would  play  an  important  role  in  the 
detection  of  plagiarism.  Since  Computer  Science  210 
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students  use  cards  and  do  not  have  access  to  the  sophis¬ 
ticated  editing  features  of  a  time  sharing  terminal,  it 
was  thought  that  changes  to  the  style  of  a  copied  pro¬ 
gram  would  be  clumsy  at  best.  This  resulted  in  the 
rejection  of  any  sophisticated  indenting  functions  and 
the  selection  of  a  simple  one.  The  function  currently 
coimts  the  number  of  left,  right,  and  unindented  lines  of 
code.  The  indenting  function  is  created  as  follows: 

indenting  function  = 

(  (left  indentations)  mod  1000)  *  1000000  + 

(right  indentations)  mod  1000)  *  1000  + 

(zero  indentations)  mod  1000) 

The  results  have  proved  disappointing.  If  all  of 
the  input  programs  were  processed  through  a  "pretty 
printer,"  an  indenting  function  might  become  important. 
This  additional  cost  is  presently  considered  prohibitive, 
2uid  it  is  contrary  to  the  intent  of  Accuse  being  inex¬ 
pensive  to  use.  The  unimportance  of  an  identing  function 
necessitated  the  search  for  an  alternate  parameter  that 
would  reflect  some  characteristic  of  the  lines  of  a  pro¬ 
gram.  The  result  was  the  idea  to  count  lines  of  execut¬ 
able  code  in  a  program,  and  the  results  of  this  decision 
are  thus  far  promising. 

The  decision  to  introduce  the  use  of  heuristics 
in  the  way  counts  are  made  in  Accuse  was  two-fold;  (1) 
to  make  plagiarism  difficult  to  achieve,  and  (2)  to  make 
Accuse 's  repeated  use  feasible  in  light  of  the  fact  that 
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its  use  will  quickly  become  common  knowledge.  The 
heuristics  are  simple  and  straightforward. 

"Total  operators"  does  not  include  assignment 
operators.  In  addition,  for  every  assignment  operator 
found,  two  operemds  are  stibtracted  from  "total 
operands,"  and  "code  lines"  is  decremented.  The  purpose 
of  this  is  to  prevent  Accuse  from  being  misled  by 
unnecessary  initializations  and  unnecessary  assignment 
statements.  This  desire  roughly  correlates  to  the  pre¬ 
vention  of  M.  Halstead's  fifth  defined  impurity, 
"unwarranted  assignments." 

"Code  lines"  ignores  blank  lines,  comment  lines, 
and  declarations.  It  counts  only  the  lines  of  executable 
code  within  a  program.  This  is  intended  to  prevent 
excess  declarations  and  comments  from  affecting  this 
parameter's  value. 

Accuse  is  also  selective  about  what  it  calls 
operators.  A  "BEGIN  END"  combination  and  "()"  combina¬ 
tion  are  considered  operators  in  Software  Science. 

Because  BEGINS,  ENDs,  2md  parentheses  can  be  added  to 
Pascal  code  where  not  required.  Accuse  chooses  to  ignore 
them.  A  semicolon  is  ignored  for  essentially  the  same 
reason.  IF  is  considered  an  operator  while  THEN  is  not. 
ELSE  is  considered  an  operator  because  it  is  not  a 
necessary  part  of  an  IF  statement. 


As  Accuse  only  counts  variables,  the  obvious 
tactic  of  changing  variable  neunes  makes  no  difference  to 
Accuse.  Since  Pascal  requires  declarations,  Accuse  can 
keep  track  of  variables  declared  and  subsequently  used 
or  not  used.  Hence,  declaring  extra  variables  and  then 
not  using  them  does  not  affect  Accuse 's  analysis.  Con- 
steuits  of  enumerated  types  and  tag  fields  in  case 
clauses  of  record  declarations  that  contain  a  declara¬ 
tion  are  considered  variables.  Since  these  constants 
cannot  be  read  or  written,  their  nonuse  is  considered 


notable. 


CHAPTER  IV 


SHORTCOMINGS 

Accuse  has  three  main  drawbacks.  The  first  is 
that  it  is  unable  to  detect  five  of  the  impurities 
defined  by  Halstead.  This  may  in  fact  not  be  that 
critical;  for  any  system  to  detect  and  then  "undo"  any 
impurities  once  found  would  at  the  least  be  expensive; 
in  addition,  the  individual  we  wish  to  catch  plagiarizing 
is  not  likely  to  introduce  these  impurities. 

The  second  is  that  because  the  input  program  is 
not  parsed,  but  is  guided  by  a  driver  that  expects  a 
compilable  program,  syntactically  incorrect  programs  may 
be  accepted  by  Accuse.  Accuse  uses  a  modified  Pascal 
scanner,  specifically  the  Pascal- J  scanner  made  available 
to  students  at  the  University  of  Colorado  for  graduate 
work.  Hence  it  detects  some  syntax  errors:  for  example, 
incorrect  literal  strings  and  comments  that  lack  their 
left  part.  However,  it  may  very  well  accept  syntactically 
incorrect  programs. 

The  final  drawback  is  that  since  the  current 
policy  in  conjunction  with  the  use  of  Accuse  does  not 
include  the  user  making  the  students'  "graded  runs," 
there  is  nothing  to  prevent  a  student  from  changing  or 
sabotaging  his  program  before  he  submits  it  for  processing 
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by  Accuse.  The  cost  of  rerunning  all  students'  pro¬ 
grams  is  presently  considered  prohibitive,  and  checking 
every  student's  final  listing  against  an  unordered 
listing  of  150  prograuns  is  impractical. 

Die  first  drawback  is  not  a  detriment  if  grading 
enforces  a  policy  that  does  not  allow  these  impurities 
by  exacting  a  severe  penalty  for  their  use. 

The  second  and  third  are  resolved  if  Accuse  is 
used  in  the  context  of  a  larger  tool. 
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CHAPTER  V 

OUTPUT 


Accuse  prints  four  results  for  the  user.  The 
first  is  a  dun^  of  each  program's  identifier  and  its 
values  of  the  20  parameters  measured  by  Accuse.  This 
dump  is  sorted  on  the  "indenting  function." 

The  second  result  is  a  dump  of  each  program*  s 
identifier  and  its  respective  values  of  the  seven 
parameters  used  to  compute  the  correlation  number;  each 
parameter  list  is  sorted  smallest  to  largest.  In  the 
output,  the  column  headed  FOR  STMT  actually  contains  the 
total  number  of  control  statements.  This  is  the  result  of 
the  implementation  of  summing  parameters. 

The  third  result  is  a  frequency  distribution 
graph  that  indicates  the  number  of  pairs  of  programs  with 
like  correlation  numbers.  A  new  addition  to  the  listings 
is  the  Tukey  estimate  for  suspicion  of  plagiarism. 

The  final  result  is  a  list  of  all  pairs  of  pro¬ 
grams  with  correlation  number  greater  than  or  equal  to 
28.  Twenty-nine  is  currently  identified  as  the  number 
that  indicates  the  possibility  of  plagiarism,  with  32 
the  maximum  correlation  number  possible. 


CHAPTER  VI 


DEFINING  THE  CORRELATION  SCHEME 

The  scheme  tha-t  computes  the  correlation  number 
is  only  a  tentative  one.  The  current  scheme  was 
developed  and  tuned  by  using  a  group  of  43  programs  from 
an  introductory  course.  Code  for  three  of  the  programs 
was  written  together,  but  finished  individually.  The 
"importance"  values  for  the  seven  correlation  parameters 
were  then  adjusted  until  these  three  programs  were 
brought  into  the  domain  of  "those  programs  suspected  of 
plagiarism. " 

The  current  correlation  scheme  involves  computing 
an  increment  for  each  pair  of  affected  programs  based  on 
the  equation 

increment  =  "importance"  -  (pcoxanta  -  pcountb) 
where  pcounta  and  pcountb  represent  parameter  counts  and 
(pcounta  -  pcountb)  is  less  than  or  equal  to  some  "window" 
size  depending  on  the  particular  parameter. 

The  computation  of  the  correlation  number  may 
well  be  subject  to  improvement  by  a  more  elaborate  scheme 
or  by  simple  changes  to  the  importance  factors. 

Five  runs  of  Accuse  follow  the  text  of  this  paper. 
The  first  run  (Appendix  B)  processed  13  programs,  three  of 
which  were  input  twice.  Included  in  this  run  is  a 
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printout  of  the  triangular  matrix  that  contains  cor¬ 
relation  values  of  the  pairs  of  programs.  This  matrix 
is  not  printed  in  a  production  model  of  Accuse. 

Below  we  illustrate  the  computation  of  the 
correlation  number  for  a  pair  of  programs  in  the  first 
run.  Before  proceeding,  it  is  necessary  to  note  the 
following  "window"  sizes  and  "importance"  factors  for 
each  of  the  correlation  parameters: 

1.  total  operators 

window  size  =  5 

importance  factor  =  6 

2.  total  operands 

window  size  =  5 

importance  factor  =  6 

3.  unique  operators 

window  size  =  3 

importance  factor  =  5 

4.  unique  operands 

window  size  =  3 

importance  factor  =  5 

5.  code  lines 

window  size  =  3 

importance  factor  =  5 

6.  declared  variables  (and  used) 

window  size  =  2 

importance  factor  =  3 

7.  control  statements 

window  size  =  1 

importance  factor  =  2 

The  correlation  number  for  the  pair  of  programs 
T102  and  T107  (see  Appendix  B,  p.  32  )  is  computed  as 


follows : 
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1.  T107  -  T102  =  8 

Eight  is  greater  than  the  window  size  for 
this  parameter,  hence  these  are  not  "affected" 
programs . 

2.  T107  -  T102  =  16 

Again,  these  are  not  "affected"  programs. 

3.  T107  -  T102  =  1 

These  prograuns  are  now  within  the  window 
size,  and  an  increment  is  calculated  for  this 
pair  of  programs : 

increment  =  5  -  (25  ~  24)  =4 
correlation  number  =  4 

4.  T102  -  T107  =  0 

increment  =5-  (13-13)  =5 
correlation  number  =  9 

5.  T102  -  T107  =  1 

increment  =5-  (64  -63)  =4 
correlation  number  =  13 

6.  T107  -  T102  =  0 

increment  -  3  -  (11-11)  =3 
correlation  number  =  16 

7.  T102  -  T107  =  0 
increment  =2-  (4-4)  =2 
correlation  number  =18 

The  second  listing  (Appendix  C)  is  a  production 

run  of  Accuse.  There  were  137  input  programs  consisting 

of  13,374  lines  of  code.  Accuse  processed  the  code  on  a 

CDC  machine  at  a  cost  of  $12.32.  It  required: 

FL  TO  LOAD  110700  FL  TO  RUN  77100 

89.956  CP  SECS  105237B  CM  USED 

The  maximum  niomber  of  asterisks  printed  in  the 
distribution  graph  is  40;  hence  the  "flat"  distribution. 

Accuse  prints  all  pairs  of  programs  with  correla¬ 
tion  nxunber  greater  than  or  equal  to  28,  though  29  is  the 
number  that  indicates  the  possibility  of  plagiarism. 


CHAPTER  VII 


ANALYSIS  OF  RESULTS 

Effectiveness 

A  question  that  arises  is,  "What  are  the  chances 
that  two  programs  will  be  declared  similar  when  they  have 
been  independently  written?"  A  similar  question  is,  "How 
many  programs  can  Accuse  accept  before  so  many  programs 
are  suspected  of  plagiarism  that  Accuse 's  results 
become  unacceptable?" 

These  questions  are  not  addressed  by  Ottenstein. 
He  analyzes  his  findings  and  concludes  that  the  way  he 
categorizes  the  input  programs  results  in  a  somewhat 
normal  distribution,  in  agreement  with  our  intuition. 

He  makes  the  observation  that  if  two  programs  are  sus¬ 
pected  of  being  similar  (because  they  have  the  same  N 
value) ,  the  odds  that  they  are  similar  are  greater  if 
the  correlation  number  occurs  at  one  of  the  extreme 
values  of  N.  He  concludes  that  any  correlation  function 
that  one  could  derive  that  produces  a  constant  distribu¬ 
tion  would  not  be  accurate  or  necessarily  desirable 
because,  in  general,  meaningful  measurements  of  human 
behavior  produce  uneven  distributions  [4]. 

I  see  two  aspects  to  these  questions.  The  first 
addresses  the  size  of  the  problem  being  solved.  A 
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These  larger  ranges  imply  the  occurrences  of 
lower  correlation  niimbers.  The  frequency  distribution 
graph  tells  us  six  pairs  of  programs  have  a  correlation 
number  of  28  or  higher. 

These  observations  appeal  to  our  intuition.  The 
wider  the  ranges,  the  lower  the  correlation  numbers,  Md 
vice  versa. 

Another  attractive  conjecture  is  that  the  more 
input  programs,  the  higher  the  correlation  numbers 
generated.  In  our  examples  above,  our  expectation  is 
incorrect.  The  first  set  of  data  where  nine  pairs  of 
programs  correlate  at  28  or  higher  inputs  43  programs. 

The  second  inputs  137  programs,  and  only  six  pairs  of 
programs  correlate  at  28  or  higher. 

We  make  three  assertions:  (1)  that  a  simple  and 
short  prograun  is  going  to  generate  more  pairs  of  programs 
with  high  correlation  numbers  than  will  a  more  difficult 
and  longer  prograua  when  both  generate  the  saune  number  of 
pairs  of  prograuns,  (2)  that  the  number  of  programs  that 
Accuse  can  accept  before  its  results  are  unacceptable  is 
a  function  of  both  the  number  of  input  programs  auid  the 
complexity  and  length  of  those  programs,  and  (3)  that  the 
more  independent  correlation  parauneters,  the  lower  the 
correlation  numbers. 

The  first  two  have  already  been  argued.  The  third 
can  be  argued  as  follows:  let  us  consider  the  seven 
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correlation  parameters  as  independent  events;  for  each 
parameter,  one  can  calculate  a  theoretical  prob£d>ility 
I  that  two  programs  will  have  the  same  value;  multiplying 

these  seven  probabilities  together  will  give  the  theo¬ 
retical  probEdjility  that  two  programs  will  have  the  seune 
>  value  for  every  parameter;  removing  any  of  the  given 

parameters  will  clearly  increase  this  product,  hence 
increasing  the  likelihood  of  two  programs  having  a  maxi¬ 
mum  correlation  number. 

When  Plagiarism  Occurs 

Available  data  supports  the  selection  of  29  as 
the  number  that  suggests  plagiarism.  This  choice  was 
made  through  observation,  and  is  by  no  means  absolute. 

The  interesting  point  of  analyzing  our  data  is 
that  we  can  look  at  it  from  two  different  aspects.  The 
first  is  as  above,  where  we  viewed  the  results  in  terms 
of  the  individual  parameters.  Bulut  makes  the  statement 
that  the  probability  of  using  nl  and  n2  exactly  N1  and  N2 
times  in  two  different  algorithms  is  very  slim  [1] .  Both 
our  results  and  Ottenstein's  results  verify  his  assertion. 

The  second  way  to  view  our  results  comes  from  the 
m£mner  in  which  we  categorize  or  "fingerprint"  the  input 
programs.  Ottenstein  uses  N  to  categorize  his  input  pro¬ 
grams,  and  it  is  the  distribution  that  N  creates  that 
Ottenstein  analyzes.  We  categorize  our  programs  using  a 
correlation  number,  and  if  we  analyze  the  distribution 


created  by  our  correlation  numbers,  we  come  to  somewhat 
the  same  conclusions. 

First,  the  correlation  numbers  create  a  somewhat 
normal  distribution,  though  they  appear  not  to  fit  any 
"standard"  distributions  [7]. 

Second,  by  the  way  we  have  built  our  correlation 
scheme,  two  progr2utis  are  declared  similar  only  if  the 
correlation  number  occurs  at  an  extreme  value  of  the 
distribution.  In  Ottenstein's  categorization,  two  pro¬ 
grams  can  be  declared  similar  in  the  center  of  his  dis¬ 
tribution.  Hopefully,  then,  our  correlation  scheme  is 
better . 

Finally,  since  the  distribution  created  by  our 
correlation  scheme  is  not  a  uniform  one,  it  is  likely  to 
be  an  accurate  measurement  of  human  behavior  [4]. 

Looking  at  the  data  from  this  viewpoint,  it  would 
be  nice  to  have  a  verification  of  our  selection  of  29  as 
a  choice  for  the  number  that  suggests  plagiarism.  J.W. 
Tukey  suggests  a  way  to  analyze  distributions  that  fit 
no  standard  distributions  [6].  This  analysis  fits  well 
with  our  desire. 

He  suggests  taking  two  "hinges,"  one  each  at  the 
midpoints  between  the  outer  edges  and  the  median  of  the 
distribution  (these  hinges  correspond  to  the  quartiles) . 
He  defines  one  and  one  half  times  the  difference  between 
the  values  that  occur  at  these  points  as  a  "step." 
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Finally,  any  values  that  occur  beyond  the  value  at  these 
hinges  plus  two  steps  (called  the  "outerfences")  are 
considered  unreasonable. 

For  a  hypothetical  example,  then,  if  the  lower 
hinge  occurs  at  14  and  the  upper  hinge  at  17,  our  outer- 
fence  occurs  at 

17  +  2  *  (1.5*(17-14))  =  26 

and  any  correlation  number  greater  than  26  is  considered 
unreasonable;  or,  in  our  application,  considered 
plagiarism. 

Accuse  has  been  altered  to  compute  this  value; 
test  results,  though  inconclusive,  are  encouraging. 

Though  the  fourth  listing  (Appendix  E)  provided  gives  a 
number  of  27  as  being  the  outerfence  (hence  28  implies 
plagiarism) ,  it  is  easy  to  see  that  there  are  no  programs 
that  are  beyond  the  outerfence.  One  can  conclude  that  in 
this  case,  29  is  as  good  a  guess  as  the  computed  28. 

Computing  the  probability  that  two  progrcuns  would 
have  the  same  value  for  a  given  parameter  was  discussed 
earlier.  This  computation  could  lead  to  supplying  the 
user  with  some  additional  information  that  will  help  him 
in  his  judgement  as  to  whether  or  not  plagiarism  has 
occurred.  If  we  look  at  the  fourth  listing,  we  can  make 
some  observations. 

First,  let  us  make  the  assumption  that  for  each 
range  of  values  for  a  given  parameter,  each  value  has  an 
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equal  likelihood  of  occurring.  Second,  let  us  arbitrarily 
throw  away  the  largest  and  smallest  values  of  each 
parameter.  Then,  f  .  aach  range  of  values  observed,  we 
can  calculate  the  u  er  of  expected  pairs  of  programs 
with  equal  values  for  that  given  parcimeter.  Let  us  begin 
with  TOTAL  OPERS.  Range  =  151  -  67  +  1  *  85.  Any  two 
programs  written  independently  will  be  assumed  to  have  a 
total  operator  count  of  between  67  and  151,  and  the 
probability  of  them  having  any  one  of  the  possible  values 
is  1/85  *  1/85  =  1/7225.  The  probability  of  their  having 
any  of  the  possible  values  over  the  entire  range  is 
1/7225  +  1/7225  +  .  .  .  +  1/7225  =  1/85.  Given  that 
there  are  31  input  programs,  and  hence  (31  *  30) /2  =  465 
pairs  of  programs,  one  can  expect  5.5,  or  approximately 
six  pairs  of  programs  to  have  equal  values.  We  observe 
four.  Following  this  through,  we  can  calculate  expected 


versus  observed  pairs  for  every  parameter: 
TOTAL  OPERS 

expected  =  465/85  =  5.5  =  6 
observed  =  4 

TOTAL  OPNDS 

expected  *  465/62  =  7.5  =  8 
observed  =  11 

UNIQ  OPERS 

expected  =  465/7  =  66.4  =  67 
observed  »  73 

TJNIQ  OPNDS 

expected  =  465/24  =  19.4  =  20 
observed  =  24 


CODE  LINES 

expected  =  465/45  *  10.3  =  11 

observed  =  19 

OECL  VARS 

expected  =  465/24  «=  18.6  *  19 

observed  =  21 

FOR  STMTS 

expected  *  465/6  =  77.5  =  78 
observed  *  104 

Fron.  the  results,  it  appears  not  to  be  unreasonable  to 

assume  that  all  values  are  equally  likely.  A  statistician 

then,  can  calculate  these  values  and  make  a  judgement  as 

to  whether  it  appears  that  plagiarism  occurred  for  any 

parameter.  Doing  this  for  every  parameter  would  allow  one 

to  conjecture  if  plagiarism  occurred  over  all  parameters 

and  hence  over  an  entire  program.  Coming  up  with  some 

final  probability  that  plagiarism  occurred  for  the  input 

programs  would  contribute  to  the  successful  use  of  Accuse. 

Side  Issues 

One  of  the  most  revealing  aspects  of  this  research 
has  been  the  often  enormous  variations  in  the  measured 
parameters.  It  is  incredible  to  think  that  two  programs 
as  analyzed  by  Accuse  could  possibly  solve  the  same 
problem.  This  gives  rise  to  a  suggested  alternate  use  of 
Accuse. 

Accuse,  modified  appropriately,  could  measure  the 

"goodness"  of  a  program.  Its  analysis  could  identify  both 
excesses  (for  example,  the  programmer  used  an  excessive 


number  of  varicdsles)  and  shortcomings  (for  example, 
the  programmer  used  few  comments) .  Accuse  is  also 
capable  of  identifying  Vcuriables  declared  and  not  used. 
This  infoirmation  could  allow  a  grader  to  make  a  quanti¬ 
tative  analysis  of  any  program  at  a  glance  and  grade  the 
program  accordingly. 
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CHAPTER  VIII 

CONCLUSION 

i 

The  sabotaged  programs  givcm  as  input  to  Accuse 
show  that  it  cannot  stand  alone  as  a  detector  of 
plagiarism,  but  must  in  fact  be  part  of  a  larger  system. 
This  system  should  be  one  that  retrieves  the  student's 
program,  compiles  it,  runs  it  on  data  the  student  has 
never  seen,  and  then  sends  the  student's  program  into  a 
file  that  will  eventually  be  processed  through  Accuse. 

Accuse  accomplished  its  goal  of  being  inexpensive 
to  use.  Results  were  actually  better  than  expected. 

Finally,  Accuse  needs  to  be  put  into  production 
use  to  verify  or  reject  assertions  made  here. 
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APPENDIX  A 


DEFINITION  OF  PARAMETERS 
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